Unsound Use of str::from_utf8_unchecked in keyword_token #1859

lwz23 · 2024-11-30T13:30:00Z

Description
The keyword_token function uses unsafe { str::from_utf8_unchecked(word) } to convert a byte slice (&[u8]) into a string slice (&str) without validating whether the input is valid UTF-8. This introduces undefined behavior (UB) if the word parameter contains invalid UTF-8 bytes. The absence of validation makes the function unsound.

libsql/vendored/sqlite3-parser/src/dialect/mod.rs

Line 69 in 9241b00

.get(UncasedStr::new(unsafe { str::from_utf8_unchecked(word) }))

pub fn keyword_token(word: &[u8]) -> Option<TokenType> {
    KEYWORDS
        .get(UncasedStr::new(unsafe { str::from_utf8_unchecked(word) }))
        .cloned()
}

Problems:
this function is a pub function, so I assume user can control the word field, it cause some problems.

Undefined Behavior on Invalid UTF-8:
unsafe { str::from_utf8_unchecked(word) } assumes that the word slice is valid UTF-8. If this assumption is violated, undefined behavior occurs immediately.
The function does not verify that word is valid UTF-8 before invoking the unsafe conversion.
No Safety Contract:
The function is not marked as unsafe, nor does it document the requirement that the word input must be valid UTF-8. This makes it easy for callers to misuse the function by passing invalid inputs.
Potential Exploitation:
If word is derived from untrusted or external input, it could contain invalid UTF-8. This could lead to crashes, memory corruption, or other unpredictable behavior.

Suggestion

mark this function as unsafe and provide safety doc.
add some check in the function body eg. use from_utf8 instead.

Additional Context:
Unsafe code should only be used when safety invariants are strictly guaranteed. The current implementation assumes that the word input is always valid UTF-8, but this is not enforced or documented, making the function unsound. By switching to std::str::from_utf8, the function can remain safe and robust while handling invalid input gracefully.

The text was updated successfully, but these errors were encountered:

penberg · 2025-06-06T10:30:04Z

Let's keep this issue open, but it's something that ought to be fixed in the upstream repo first.

lwz23 closed this as completed Dec 2, 2024

lwz23 reopened this May 28, 2025

kmr-ankitt mentioned this issue Jun 5, 2025

fix: make keyword_token safe by validating UTF-8 input #2093

Closed

kmr-ankitt mentioned this issue Jun 6, 2025

fix: make keyword_token safe by validating UTF-8 input tursodatabase/limbo#1677

Merged

pereman2 closed this as completed in tursodatabase/limbo@8cd7c7e Jun 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unsound Use of str::from_utf8_unchecked in keyword_token #1859

Unsound Use of str::from_utf8_unchecked in keyword_token #1859

lwz23 commented Nov 30, 2024 •

edited

Loading

penberg commented Jun 6, 2025

Uh oh!

Unsound Use of str::from_utf8_unchecked in keyword_token #1859

Unsound Use of str::from_utf8_unchecked in keyword_token #1859

Comments

lwz23 commented Nov 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

penberg commented Jun 6, 2025

Uh oh!

lwz23 commented Nov 30, 2024 •

edited

Loading