Added support for JSON containing multiple events by tim427 · Pull Request #2545 · certtools/intelmq

tim427 · 2024-12-11T13:53:55Z

Currently the intelmq.bots.parsers.json.parser is only able to parse or single events in JSON, or multiple events in JSON, each on their own line.

This PR contains an option to parse multiple events within a JSON, by adding the multiple_events (boolean) to the config.

sebix

Do you have an example feed at hand (so we can extract an example for the tests, add it to the docs)? To my knowledge no documented feed is using such a format.

sebix · 2024-12-16T08:07:32Z

intelmq/bots/parsers/json/parser.py

        report = self.receive_message()
-        if self.splitlines:
+        if self.multiple_events:
+            lines = [json.dumps(event) for event in json.loads(base64_decode(report['raw']))]


Converting the data forth and back appears to be inefficient.

Yeah, could imagine this. Any tips to do this a proper way?

Currently this PR is running in our production and works just fine, but I agree on the double JSON conversion isn't the most efficient way to do this.

I'd suggest just loading here, and then in L33 checking config (or maybe just the object type?), and using MessafeFactory.from_dict if we already loaded data :)

tim427 · 2024-12-16T11:02:13Z

Do you have an example feed at hand (so we can extract an example for the tests, add it to the docs)? To my knowledge no documented feed is using such a format.

Our National Cyber Security Centre (NCSC) is sending us "IntelMQ JSON's" in a ZIP-file by mail.
The ZIP-file contains a single JSON-file.

Here's an example (I tried to anonymise most values):

[
    {
        "extra.dataset_collections": "0",
        "extra.dataset_files": "1",
        "extra.dataset_infected": "false",
        "extra.dataset_ransom": "null",
        "extra.dataset_rows": "0",
        "extra.dataset_size": "301",
        "protocol.application": "https",
        "protocol.transport": "tcp",
        "source.asn": 12345689,
        "source.fqdn": "fqdn-example-1.tld",
        "source.geolocation.cc": "NL",
        "source.geolocation.city": "Enschede",
        "source.geolocation.latitude": 52.0000000000000,
        "source.geolocation.longitude": 6.0000000000000,
        "source.geolocation.region": "Overijssel",
        "source.ip": "127.1.2.1",
        "source.network": "127.1.0.0/16",
        "source.port": 80,
        "time.source": "2024-12-16T02:08:06+00:00"
    },
    {
        "extra.dataset_collections": "0",
        "extra.dataset_files": "1",
        "extra.dataset_infected": "false",
        "extra.dataset_ransom": "null",
        "extra.dataset_rows": "0",
        "extra.dataset_size": "615",
        "extra.os_name": "Ubuntu",
        "extra.software": "Apache",
        "extra.tag": "rescan",
        "extra.version": "2.4.58",
        "protocol.application": "https",
        "protocol.transport": "tcp",
        "source.asn": 12345689,
        "source.fqdn": "fqdn-example-2.tld",
        "source.geolocation.cc": "NL",
        "source.geolocation.city": "Eindhoven",
        "source.geolocation.latitude": 51.0000000000000,
        "source.geolocation.longitude": 5.0000000000000,
        "source.geolocation.region": "North Brabant",
        "source.ip": "127.1.2.2",
        "source.network": "127.1.0.0/16",
        "source.port": 443,
        "time.source": "2024-12-16T02:08:12+00:00"
    },
    {
        "extra.dataset_collections": "0",
        "extra.dataset_files": "1",
        "extra.dataset_infected": "false",
        "extra.dataset_ransom": "null",
        "extra.dataset_rows": "0",
        "extra.dataset_size": "421",
        "protocol.application": "http",
        "protocol.transport": "tcp",
        "source.asn": 12345689,
        "source.geolocation.cc": "NL",
        "source.geolocation.city": "Enschede",
        "source.geolocation.latitude": 52.0000000000000,
        "source.geolocation.longitude": 6.0000000000000,
        "source.geolocation.region": "Overijssel",
        "source.ip": "127.1.2.3",
        "source.network": "127.1.0/16",
        "source.port": 9000,
        "time.source": "2024-12-15T21:09:49+00:00"
    }
]

sebix · 2025-08-14T08:03:33Z

I added a few changes here:

A test cases
- that's where I noticed that the json parser didn't add the required classification.type field if it doesn't exist in input data, so added that as well
The optimizations as discussed above
- Which also revealed another bug in Message.from_dict which modified the parameter
add documentation

sebix · 2025-08-14T08:44:05Z

...and found & fixed another bug in intelmq.lib.message.Message.from_dict:
Raise a ValueError if message type is not determinable

sebix · 2025-08-14T09:32:01Z

As I wrote a major part of this PR, I won't merge it myself

@aaronkaplan could you do the review instead?

sebix · 2025-08-25T18:36:07Z

Rebased on develop to fix conflicts

sebix · 2025-08-29T11:40:23Z

@kamil-certat maybe you can have a look?

kamil-certat

It looks good except of non-handled case of both flags being set. Please either handle it or explicitly forbid

kamil-certat · 2025-09-04T13:34:11Z

intelmq/bots/parsers/json/parser.py

+            lines = base64_decode(report["raw"]).splitlines()
        else:
-            lines = [base64_decode(report['raw'])]
+            lines = [base64_decode(report["raw"])]


By having two flags and no enforcement if they are exclusive, we allow setting all combinations, including

splitlines = True multiple_events = True

which is not supported by the code, but still makes a theoretically possible case of multiple lines, each of them containing a list of multiple event dictionaries.

I'd suggest on of following solutions:

support that case,

switch from two flags to a one format (or similar) configuration, that supports options like e.g. single, splitlines, multiple

add a config validation forbidding starting a bot with both flags set.

Thanks for catching this.
I chose the easiest option: Forbidding it.
There's no reason to implement it; switching to a different mode setting would also increase the required effort, and require upgrade functions. We can always make the switch to a format parameter later with the same effort as now.

add test cases for multiple events mode optimize runtime for multiple events mode add documentation add classification.type = undetermined if input data does not contain the field fix bugs in intelmq.lib.message.Message.from_dict: Do not modify the dict parameter by adding the `__type` field Raise a ValueError if message type is not determinable

@kamil-certat

pointed out by @kamil-certat in https://github.com/certtools/intelmq/pull/2545/files#r2322188136

sebix self-assigned this Dec 16, 2024

sebix added the component: bots label Dec 16, 2024

sebix reviewed Dec 16, 2024

View reviewed changes

tim427 closed this Apr 22, 2025

sebix reopened this Apr 23, 2025

sebix added this to the 3.5.0 milestone Apr 23, 2025

sebix force-pushed the develop branch 3 times, most recently from f9d8d81 to 9215f92 Compare August 14, 2025 08:43

sebix added the feature Indicates new feature requests or new features label Aug 18, 2025

sebix requested a review from aaronkaplan August 20, 2025 07:24

sebix assigned aaronkaplan Aug 20, 2025

sebix modified the milestones: 3.6.0, 3.5.0 Feature Release Aug 20, 2025

sebix force-pushed the develop branch from 9215f92 to da5ff80 Compare August 25, 2025 18:35

kamil-certat requested changes Sep 4, 2025

View reviewed changes

tim427 and others added 3 commits September 4, 2025 15:50

Added support for JSON containing multuple events

9e0e948

json parser: check for parameter incompatibility

729182a

pointed out by @kamil-certat in https://github.com/certtools/intelmq/pull/2545/files#r2322188136

sebix force-pushed the develop branch from e44c5b2 to 729182a Compare September 4, 2025 13:51

kamil-certat approved these changes Sep 4, 2025

View reviewed changes

sebix merged commit d2cac08 into certtools:develop Sep 4, 2025
20 checks passed

Comments

Conversation

tim427 commented Dec 11, 2024

Uh oh!

sebix left a comment

Choose a reason for hiding this comment

Uh oh!

sebix Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

tim427 Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

kamil-certat Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

tim427 commented Dec 16, 2024

Uh oh!

sebix commented Aug 14, 2025

Uh oh!

sebix commented Aug 14, 2025

Uh oh!

sebix commented Aug 14, 2025

Uh oh!

sebix commented Aug 25, 2025

Uh oh!

sebix commented Aug 29, 2025

Uh oh!

kamil-certat left a comment

Choose a reason for hiding this comment

Uh oh!

kamil-certat Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

sebix Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants