Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43605: [Go][Parquet] unpack32Avx2 returns 0 if its buffer is empty #43616

Closed

Conversation

don4get
Copy link

@don4get don4get commented Aug 8, 2024

Rationale for this change

In case a uint16 column of a parquet file is fully filled with 0, it may happen in some cases that arrow fails to open the file.
Issue is described in GH-43605 and related PR: #43607

I created a new PR to be able to differenciate the 2 changes:

What changes are included in this PR?

In case the buffer of unpack32Avx2 is empty, this function returns 0.

Are these changes tested?

Yes, I need this branch to be merged to be able to run the test in your CI/CD:
apache/parquet-testing#57

Are there any user-facing changes?

I would say the API is not broken and no changes are expected from users, plus unit tests are passing on my setup with ./ci/scripts/go_test.sh. I would like to see the results of the jobs in your GH workflows.

This PR contains a "Critical Fix".

Comment on lines 48 to 50
if buffer.Len() == 0 {
return 0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to shift this above line 42 and have it be:

if n == 0 {
    return 0
}

instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the dataset I have it makes sense.
I don't have enough experience with arrow to know if n!=0 && buffer.Len()==0 is possible. In this case, the program would crash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the way the code is written, buffer.Len() should always end up being equal to n because we call buffer.Reset() and buffer.Grow(n) so I think we should make this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right. I moved the condition directly on nbits. Since batch is already checked, it covers all cases where n is equal to 0. Is it OK for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're dealing with integer division, that doesn't cover all cases. To cover all cases you'd need something like if batch * nbits < 8 since integer division would result in n == 0 for any scenario where batch * nbits is less than 8 (since that would result in 0 when we divide by 8). So it might just be better to just check if n == 0 rather than try to find all the cases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I moved the condition on n.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 8, 2024
@don4get don4get force-pushed the unpack32avx2-return-0-if-empty-buf branch from 4d97086 to 7ba9479 Compare August 9, 2024 08:43
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 9, 2024
@don4get don4get force-pushed the unpack32avx2-return-0-if-empty-buf branch from 7ba9479 to 42c5e45 Compare August 9, 2024 21:43
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 9, 2024
@don4get don4get force-pushed the unpack32avx2-return-0-if-empty-buf branch from 42c5e45 to 1af4ee9 Compare August 9, 2024 21:44
@don4get don4get force-pushed the unpack32avx2-return-0-if-empty-buf branch from 1af4ee9 to 7ef5083 Compare August 26, 2024 08:32
@don4get don4get force-pushed the unpack32avx2-return-0-if-empty-buf branch from 7ef5083 to 13bcd8d Compare August 26, 2024 08:41
@don4get don4get force-pushed the unpack32avx2-return-0-if-empty-buf branch from 13bcd8d to 07de915 Compare August 26, 2024 08:44
@don4get don4get closed this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants