Skip to content

Comments

Added support for JSON containing multiple events#2545

Merged
sebix merged 3 commits intocerttools:developfrom
tim427:develop
Sep 4, 2025
Merged

Added support for JSON containing multiple events#2545
sebix merged 3 commits intocerttools:developfrom
tim427:develop

Conversation

@tim427
Copy link
Contributor

@tim427 tim427 commented Dec 11, 2024

Currently the intelmq.bots.parsers.json.parser is only able to parse or single events in JSON, or multiple events in JSON, each on their own line.

This PR contains an option to parse multiple events within a JSON, by adding the multiple_events (boolean) to the config.

@sebix sebix self-assigned this Dec 16, 2024
Copy link
Member

@sebix sebix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example feed at hand (so we can extract an example for the tests, add it to the docs)? To my knowledge no documented feed is using such a format.

report = self.receive_message()
if self.splitlines:
if self.multiple_events:
lines = [json.dumps(event) for event in json.loads(base64_decode(report['raw']))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting the data forth and back appears to be inefficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, could imagine this. Any tips to do this a proper way?

Currently this PR is running in our production and works just fine, but I agree on the double JSON conversion isn't the most efficient way to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest just loading here, and then in L33 checking config (or maybe just the object type?), and using MessafeFactory.from_dict if we already loaded data :)

@tim427
Copy link
Contributor Author

tim427 commented Dec 16, 2024

Do you have an example feed at hand (so we can extract an example for the tests, add it to the docs)? To my knowledge no documented feed is using such a format.

Our National Cyber Security Centre (NCSC) is sending us "IntelMQ JSON's" in a ZIP-file by mail.
The ZIP-file contains a single JSON-file.

Here's an example (I tried to anonymise most values):

[
    {
        "extra.dataset_collections": "0",
        "extra.dataset_files": "1",
        "extra.dataset_infected": "false",
        "extra.dataset_ransom": "null",
        "extra.dataset_rows": "0",
        "extra.dataset_size": "301",
        "protocol.application": "https",
        "protocol.transport": "tcp",
        "source.asn": 12345689,
        "source.fqdn": "fqdn-example-1.tld",
        "source.geolocation.cc": "NL",
        "source.geolocation.city": "Enschede",
        "source.geolocation.latitude": 52.0000000000000,
        "source.geolocation.longitude": 6.0000000000000,
        "source.geolocation.region": "Overijssel",
        "source.ip": "127.1.2.1",
        "source.network": "127.1.0.0/16",
        "source.port": 80,
        "time.source": "2024-12-16T02:08:06+00:00"
    },
    {
        "extra.dataset_collections": "0",
        "extra.dataset_files": "1",
        "extra.dataset_infected": "false",
        "extra.dataset_ransom": "null",
        "extra.dataset_rows": "0",
        "extra.dataset_size": "615",
        "extra.os_name": "Ubuntu",
        "extra.software": "Apache",
        "extra.tag": "rescan",
        "extra.version": "2.4.58",
        "protocol.application": "https",
        "protocol.transport": "tcp",
        "source.asn": 12345689,
        "source.fqdn": "fqdn-example-2.tld",
        "source.geolocation.cc": "NL",
        "source.geolocation.city": "Eindhoven",
        "source.geolocation.latitude": 51.0000000000000,
        "source.geolocation.longitude": 5.0000000000000,
        "source.geolocation.region": "North Brabant",
        "source.ip": "127.1.2.2",
        "source.network": "127.1.0.0/16",
        "source.port": 443,
        "time.source": "2024-12-16T02:08:12+00:00"
    },
    {
        "extra.dataset_collections": "0",
        "extra.dataset_files": "1",
        "extra.dataset_infected": "false",
        "extra.dataset_ransom": "null",
        "extra.dataset_rows": "0",
        "extra.dataset_size": "421",
        "protocol.application": "http",
        "protocol.transport": "tcp",
        "source.asn": 12345689,
        "source.geolocation.cc": "NL",
        "source.geolocation.city": "Enschede",
        "source.geolocation.latitude": 52.0000000000000,
        "source.geolocation.longitude": 6.0000000000000,
        "source.geolocation.region": "Overijssel",
        "source.ip": "127.1.2.3",
        "source.network": "127.1.0/16",
        "source.port": 9000,
        "time.source": "2024-12-15T21:09:49+00:00"
    }
]

@tim427 tim427 closed this Apr 22, 2025
@sebix sebix reopened this Apr 23, 2025
@sebix sebix added this to the 3.5.0 milestone Apr 23, 2025
@sebix
Copy link
Member

sebix commented Aug 14, 2025

I added a few changes here:

  • A test cases
    • that's where I noticed that the json parser didn't add the required classification.type field if it doesn't exist in input data, so added that as well
  • The optimizations as discussed above
    • Which also revealed another bug in Message.from_dict which modified the parameter
  • add documentation

@sebix sebix force-pushed the develop branch 3 times, most recently from f9d8d81 to 9215f92 Compare August 14, 2025 08:43
@sebix
Copy link
Member

sebix commented Aug 14, 2025

...and found & fixed another bug in intelmq.lib.message.Message.from_dict:
Raise a ValueError if message type is not determinable

@sebix
Copy link
Member

sebix commented Aug 14, 2025

As I wrote a major part of this PR, I won't merge it myself

@aaronkaplan could you do the review instead?

@sebix sebix added the feature Indicates new feature requests or new features label Aug 18, 2025
@sebix sebix requested a review from aaronkaplan August 20, 2025 07:24
@sebix sebix modified the milestones: 3.6.0, 3.5.0 Feature Release Aug 20, 2025
@sebix
Copy link
Member

sebix commented Aug 25, 2025

Rebased on develop to fix conflicts

@sebix
Copy link
Member

sebix commented Aug 29, 2025

@kamil-certat maybe you can have a look?

Copy link
Contributor

@kamil-certat kamil-certat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good except of non-handled case of both flags being set. Please either handle it or explicitly forbid

lines = base64_decode(report["raw"]).splitlines()
else:
lines = [base64_decode(report['raw'])]
lines = [base64_decode(report["raw"])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By having two flags and no enforcement if they are exclusive, we allow setting all combinations, including

splitlines = True
multiple_events = True

which is not supported by the code, but still makes a theoretically possible case of multiple lines, each of them containing a list of multiple event dictionaries.

I'd suggest on of following solutions:

  1. support that case,
  2. switch from two flags to a one format (or similar) configuration, that supports options like e.g. single, splitlines, multiple
  3. add a config validation forbidding starting a bot with both flags set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this.
I chose the easiest option: Forbidding it.
There's no reason to implement it; switching to a different mode setting would also increase the required effort, and require upgrade functions. We can always make the switch to a format parameter later with the same effort as now.

tim427 and others added 3 commits September 4, 2025 15:50
add test cases for multiple events mode
optimize runtime for multiple events mode
add documentation
add classification.type = undetermined if input data does not contain
the field
fix bugs in intelmq.lib.message.Message.from_dict:
Do not modify the dict parameter by adding the `__type` field
Raise a ValueError if message type is not determinable
@sebix sebix merged commit d2cac08 into certtools:develop Sep 4, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: bots feature Indicates new feature requests or new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants