Skip to content

Unsound Use of str::from_utf8_unchecked in keyword_token #1859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lwz23 opened this issue Nov 30, 2024 · 1 comment · Fixed by tursodatabase/limbo#1677
Closed

Unsound Use of str::from_utf8_unchecked in keyword_token #1859

lwz23 opened this issue Nov 30, 2024 · 1 comment · Fixed by tursodatabase/limbo#1677

Comments

@lwz23
Copy link

lwz23 commented Nov 30, 2024

Description
The keyword_token function uses unsafe { str::from_utf8_unchecked(word) } to convert a byte slice (&[u8]) into a string slice (&str) without validating whether the input is valid UTF-8. This introduces undefined behavior (UB) if the word parameter contains invalid UTF-8 bytes. The absence of validation makes the function unsound.

.get(UncasedStr::new(unsafe { str::from_utf8_unchecked(word) }))

pub fn keyword_token(word: &[u8]) -> Option<TokenType> {
    KEYWORDS
        .get(UncasedStr::new(unsafe { str::from_utf8_unchecked(word) }))
        .cloned()
}

Problems:
this function is a pub function, so I assume user can control the word field, it cause some problems.

  1. Undefined Behavior on Invalid UTF-8:
    unsafe { str::from_utf8_unchecked(word) } assumes that the word slice is valid UTF-8. If this assumption is violated, undefined behavior occurs immediately.
    The function does not verify that word is valid UTF-8 before invoking the unsafe conversion.
  2. No Safety Contract:
    The function is not marked as unsafe, nor does it document the requirement that the word input must be valid UTF-8. This makes it easy for callers to misuse the function by passing invalid inputs.
  3. Potential Exploitation:
    If word is derived from untrusted or external input, it could contain invalid UTF-8. This could lead to crashes, memory corruption, or other unpredictable behavior.

Suggestion

  1. mark this function as unsafe and provide safety doc.
  2. add some check in the function body eg. use from_utf8 instead.

Additional Context:
Unsafe code should only be used when safety invariants are strictly guaranteed. The current implementation assumes that the word input is always valid UTF-8, but this is not enforced or documented, making the function unsound. By switching to std::str::from_utf8, the function can remain safe and robust while handling invalid input gracefully.

@penberg
Copy link
Collaborator

penberg commented Jun 6, 2025

Let's keep this issue open, but it's something that ought to be fixed in the upstream repo first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants