Skip to content

Changes for BoefjeScheduler to support deduplication #4309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jpbruinsslot
Copy link
Contributor

@jpbruinsslot jpbruinsslot commented Apr 10, 2025

Warning

This is a pre-review and should not be merged

Changes

Before we push a task on the queue, we will create additional tasks for organisations that have the same ooi present. We batch those task based on the environment settings hash so that the task runner can pick them up in one go and de-duplicate additional similar scan on the same ooi.

  • Check if ooi is present in other organisations
  • Get boefje for that organisation
  • Check if env_hash match, if so task should be created and is a de-duplicated task

Issue link

Closes #4359

QA notes

Please add some information for QA on how to test the newly created code.


Code Checklist

  • All the commits in this PR are properly PGP-signed and verified.
  • This PR only contains functionality relevant to the issue.
  • I have written unit tests for the changes or fixes I made.
  • I have checked the documentation and made changes where necessary.
  • I have performed a self-review of my code and refactored it to the best of my abilities.
  • Tickets have been created for newly discovered issues.
  • For any non-trivial functionality, I have added integration and/or end-to-end tests.
  • I have informed others of any required .env changes files if required and changed the .env-dist accordingly.
  • I have included comments in the code to elaborate on what is not self-evident from the code itself, including references to issues and discussions online, or implicit behavior of an interface.

Checklist for code reviewers:

Copy-paste the checklist from the docs/source/templates folder into your comment.


Checklist for QA:

Copy-paste the checklist from the docs/source/templates folder into your comment.

@jpbruinsslot jpbruinsslot added the mula Issues related to the scheduler label Apr 10, 2025
@jpbruinsslot jpbruinsslot self-assigned this Apr 10, 2025
@jpbruinsslot jpbruinsslot added this to KAT Apr 10, 2025
@github-project-automation github-project-automation bot moved this to Incoming features / Need assessment in KAT Apr 10, 2025
@jpbruinsslot jpbruinsslot moved this from Incoming features / Need assessment to In Progress in KAT Apr 10, 2025
# Is the ooi in other organisations? When the ooi is in other
# organisations we need to create tasks for those organisations as well.
if boefje_task.input_ooi is not None:
# FIXME: is the ooi shared between organisations, or is it a copy?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOI information is not shared between orga's. However, since they are the same (by primary key) most of their information is exactly the same. Small differences might still exists in properties that are present on the OOI but which are not used in the primary_key


# FIXME: what if boefje is disabled in the other org? Are we sure
# that in that org the boefje is allowed to scan the ooi?
# FIXME: what status do we give it do we give it completed, we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say a new state Deduplicated would fit the bill? We could then change it to 'complete' or 'error' once the actual running job completes/errors.

# FIXME: what status do we give it do we give it completed, we
# can't make it queued since it shouldn't be picked up by
# the task runner.
# FIXME: how to these tasks get associated with the other raw files?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discusses to upload the raw files for each job, and as such we'd create normalizer jobs in each subsequent organizaton (which we might deduplicate later on).
Uploaded raw files with the same content can transparently be deduplicated (by content hash) in Bytes on disk.

@jpbruinsslot jpbruinsslot linked an issue Apr 22, 2025 that may be closed by this pull request
4 tasks
@jpbruinsslot jpbruinsslot removed this from KAT Apr 23, 2025
* main: (23 commits)
  Updated some packages (#4364)
  Update URL to docs in makefile (#4346)
  Fix/catch information source errors when filling/updating the rocky knowledge base (#4347)
  Translations update from Hosted Weblate (#4363)
  Styling changes to meet the design (#4263)
  Fix scheduled reports view showing reports for all organizations (#4351)
  remove unneeded task statistics for generic task showing pages (#4344)
  Translations update from Hosted Weblate (#4353)
  Fix weblate by merging all pending translations (#4348)
  Shows the current plugin state to users who cannot enable/disable plugins themselves. (#4326)
  Fix broken normaliser list view link in plugins.html (#4331)
  Fixes toc layout on the docs (#4341)
  Updated `django_compressor` (#4342)
  Update QA testplan to add multiple organizations (#4338)
  Ignore incorrect type assumption from mypy (#4337)
  Change OOI types for findings report (#4184)
  Findings dashboard for all organizations (#4007)
  Update kat_finding_types.json, add more in dept details (#4316)
  Add changes from #4312 (#4319)
  Python 3.10 compatibility for datetime parsing in report flow (#4302)
  ...
@jpbruinsslot jpbruinsslot added this to KAT Apr 23, 2025
@github-project-automation github-project-automation bot moved this to Incoming features / Need assessment in KAT Apr 23, 2025
@jpbruinsslot jpbruinsslot moved this from Incoming features / Need assessment to Review in KAT Apr 23, 2025
@jpbruinsslot jpbruinsslot requested a review from Donnype April 23, 2025 09:50
@jpbruinsslot jpbruinsslot removed their assignment Apr 29, 2025
@jpbruinsslot jpbruinsslot changed the title Changes for scheduler to support deduplication Changes for BoefjeScheduler to support deduplication Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mula Issues related to the scheduler
Projects
Status: Review
Development

Successfully merging this pull request may close these issues.

Implement deduplication functionality in BoefjeScheduler
2 participants