Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured Message Decoder #44140

Draft
wants to merge 5 commits into
base: feature/storage/content-validation
Choose a base branch
from

Conversation

ibrahimrabab
Copy link
Contributor

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@github-actions github-actions bot added the Storage Storage Service (Queues, Blobs, Files) label Feb 11, 2025
@ibrahimrabab ibrahimrabab changed the base branch from main to feature/storage/content-validation February 11, 2025 01:17
return decodedContent.toByteArray();
}

public byte[] decodeFully() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be an overload?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm curious if we even need this method. It will probably depend on how the download integration goes but if we don't have a strong reason for it now, let's remove it as we can always add it back if we decide we need it.

Comment on lines +11 to +15
private static final int DEFAULT_MESSAGE_VERSION = 1;
private static final int V1_HEADER_LENGTH = 13;
private static final int V1_SEGMENT_HEADER_LENGTH = 10;
private static final int CRC64_LENGTH = 8;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to put these in a constants file to share between the encoder and decoder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. We should try to consolidate once the encoder PR gets merged.


public StructuredMessageDecoder(ByteBuffer inputBuffer) {
this.buffer = inputBuffer.order(ByteOrder.LITTLE_ENDIAN);
readMessageHeader();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than reading the message header during initialization, I think it makes more sense to keep all reads in the decode method and just check if the message offset is 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Mainly when we initialize the class, it is not intuitive that it would already start decoding/consuming the input buffer. I'd imagine more of the expectation would be that we consume from the input buffer when decode is explicitly called.

Copy link
Member

@kyleknap kyleknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks like a good start. I just had feedback on the initial API we should try to shoot for. After we get the public interfaces in a spot we like, I plan to look more closely into the tests and more into the details of the decoding implementation in follow up reviews.

private static final int CRC64_LENGTH = 8;

private int messageVersion;
private int messageLength;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably keep this a long to match the type that is supported when using structured message. Technically we should even be treating it as an unsigned long for any comparisons as well.

}

messageLength = (int) buffer.getLong();
flags = StructuredMessageFlags.fromValue(Short.toUnsignedInt(buffer.getShort()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we receive flags that are not part of our enumerated flags, I'm assuming this will be validated? It might be worth adding a test for it.


public StructuredMessageDecoder(ByteBuffer inputBuffer) {
this.buffer = inputBuffer.order(ByteOrder.LITTLE_ENDIAN);
readMessageHeader();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Mainly when we initialize the class, it is not intuitive that it would already start decoding/consuming the input buffer. I'd imagine more of the expectation would be that we consume from the input buffer when decode is explicitly called.

private long segmentCrc64 = 0;
private final Map<Integer, Long> segmentCrcs = new HashMap<>();

public StructuredMessageDecoder(ByteBuffer inputBuffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the constructor, I'm thinking instead of accepting an input byte buffer, we'd accept the expected content length of the individual download and the decode() method would accept a byte buffer containing a part of the entire encoded structured message. The main reason that I'm suggesting this is that I suspect in order to integrate this with the client, we will be provided a flux of byte buffers so the content of the entire encoded structured message will be broken across multiple buffers. I imagine it will be similar in integration style to the client side encryption Decryptor class (note that it accepts a Flux<ByteBuffer> but we'll just want to accept ByteBuffer for now so that Reactor integration is decoupled from logic to do the actual decoding)

}

private void readMessageHeader() {
if (buffer.remaining() < V1_HEADER_LENGTH) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then if we update the constructor to take an expected message size, we can raise this validation in the constructor similar to what Python does: https://github.com/Azure/azure-sdk-for-python/blob/b896d5c9ee202bd7c3cba51eb602562d41f4d804/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/streams.py#L466

throw new IllegalArgumentException("Unsupported structured message version: " + messageVersion);
}

messageLength = (int) buffer.getLong();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also do validation that the message length pulled from the message matches what was passed in as part of the constructor. We should be able to match Python here as well: https://github.com/Azure/azure-sdk-for-python/blob/b896d5c9ee202bd7c3cba51eb602562d41f4d804/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/streams.py#L559-L561

}
}

public byte[] decode(int size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to updating the decode to accept a byte buffer instead of a size, let's have it return a byte buffer instead of byte array. I'm mainly suggesting this because the typical downstream download logic accepts byte buffers and we'd likely need this type I suspect to make the integration smoother (i.e. all logic is only working with byte buffers)

}

public byte[] decode(int size) {
ByteArrayOutputStream decodedContent = new ByteArrayOutputStream();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more public method we are going to need is a finalize method (I'm not too tied to that name if you have a better one) that will tell the decoder that we have finished decoding and have it validate that we've consumed the entire message. I'm suggesting this because I'm not sure if the case is handled where the structured message get cut off and we only process a portion of the structured message. If we call this finalize after feeding through all byte buffers, it would then throw an error if in fact we did not get complete the decoding process.

@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

com.azure:azure-storage-common

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants