Skip to content

Initial support for Big Endian #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SammyVimes
Copy link

@SammyVimes SammyVimes commented Jan 29, 2024

Hi! Noticed that there is no support for Big Endian, so decided to start working on it.
Currently I did:

  • sz_find_1char_swar
  • sz_rfind_1char_swar
  • sz_find_2char_swar
  • sz_find_3char_swar
  • sz_find_4char_swar
  • sz_find_substring_swar

(also fixed a little bug in the little endian version of sz_rfind_1char_swar).

I also took a liberty of adding googletest dependency (of course I can remove it, I just prefer using it)
and supporting linux version of qsort_r in the test.cpp.

Although I am using macOS, I was able to test on a big endian machine using QEMU and docker. So I think the next thing to do for this pull request will be adding a workflow with the similar setup (or maybe GitHub has BE machines, I don't know yet).

With all those if (IS_LITTLE_ENDIAN) code looks funky, I know. I will try to do something with it

@ashvardanian
Copy link
Owner

Hi, @SammyVimes! Thank you for the PR! Big Endian support is a great thing to have, but we need to make a few changes to proceed with the PR.

  1. The main-dev branch is miles ahead, and is being prepared for a major release. Please use it as a reference branch.
  2. It is probably wiser to use macros for big/little endian checks.
  3. Let's avoid Google Test and other third-party utilities. They are very useful in the general case, but a careful use of assert-s makes more sense here, allowing us to conditionally log more info about the scope, hence simplifying debugging.

@SammyVimes
Copy link
Author

Hi, @ashvardanian! Sure, will change the PR accordingly. I really don’t know how I managed to miss the main-dev branch 😅

@ashvardanian
Copy link
Owner

Hi @SammyVimes! Any chance you've made any progress on the new version? I'm installing Docker QEMU images now, to test on 32-bit and big-endian architectures to generalize StringZilla further. Can continue your efforts, if you have anything you can push 🤗

@SammyVimes
Copy link
Author

@ashvardanian sorry, I was completely swamped by work. If big endian support is still required, I will happily pick up where I left off

@@ -35,6 +35,8 @@
#define NULL ((void *)0)
#endif

#define IS_LITTLE_ENDIAN (*(uint16_t *)"\0\xff" > 0x100)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endianness detection macro uses unaligned pointer access which is undefined behavior in C and can cause crashes on architectures that don't support unaligned memory access. String literals may be placed in read-only memory which could cause additional issues. A safer approach using unions or byte-by-byte comparison should be used.

📚 Relevant Docs


React with 👍 to tell me that this comment was useful, or 👎 if not (and I'll stop posting more comments like this in the future)

sz_string_start_t const end = haystack + haystack_length;
sz_string_start_t text = end - 1;
sz_string_start_t text = end;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sz_rfind_1char_swar, initializing 'text' with 'end' and then immediately using it in a decrement operation (--text) while checking alignment can lead to undefined behavior if 'end' points to the start of the buffer, causing potential buffer underflow. The alignment check should be done after ensuring text > haystack.


React with 👍 to tell me that this comment was useful, or 👎 if not (and I'll stop posting more comments like this in the future)

Copy link

recurseml bot commented May 2, 2025

😱 Found 2 issues. Time to roll up your sleeves! 😱

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants