Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add details about data store file types #62

Open
stroomdev66 opened this issue Nov 25, 2021 · 1 comment
Open

Add details about data store file types #62

stroomdev66 opened this issue Nov 25, 2021 · 1 comment

Comments

@stroomdev66
Copy link
Member

Related to this section of the user guide:
https://gchq.github.io/stroom-docs/hugo-docsy/docs/user-guide/concepts/streams/

A stream is either a single piece of data or several pieces that are joined together for the sake of efficient storage and processing.

Files ending *.mf.dat are manifest files and should be a plain text file you can open that provides details of the stream, i.e. the high level attributes of the whole stream rather than the individual entries.

All other files are either block gzip data (.bgz) or are an index (.bdy.dat and *.seg.dat).

BGZ files are a series of GZIP chunks of data appended together.

The index files are a series of byte offsets stored as Java long values (8 bytes per number), that tell stroom where the split points are between the GZIP chunks.

You will only see the *.seg.dat index files stored with processed data that is configured to segment the output. Segmenting the output means that an index is written that allows the system to seek to a specific event without having to decompress the whole stream. Instead it just decompresses the appropriate chunk and can read the event straight from that byte position.

In addition to these different types of file you will see some additional parts of the extension that indicate the type of data that is stored in the BGZ. These are as follows:

RAW_EVENTS, "revt"
RAW_REFERENCE, "rref"
EVENTS, "evt"
REFERENCE, "ref"
TEST_EVENTS, "tevt"
TEST_REFERENCE, "tref"
META, "meta"
ERROR, "err"
CONTEXT, "ctx"
DETECTIONS, "dtxn"
RECORDS, "rec"

Some of these may not exist at all as we moved away from making extensions for each stream type. Some were also experimental.

@at055612
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants