Skip to content

Overhaul wide-string support #942

Open
@trueqbit

Description

@trueqbit

As already mentioned in Slack, the support for wide-character strings needs a rather complete overhaul and/or an explicit documentation of its features. As far as I understand it is broken on macos/Linux.

  • First of all, usage of sqlite_*_text16() vs. wstring_convert<codecvt_utf8_utf16<wchar_t>> is kind of intermixed.
  • There are only unit tests for a single code path: binding to a statement and extracting from a result set via codecvt_utf8_utf16<wchar_t>, but neither conversion from a column value nor calling a function or returning from it. [see test case]
  • Returning a string from a function is broken [statement_binder<>::result()]
    • sqlite3_result_text16() expects the number of bytes, not characters. [see 3rd parameter]
    • sqlite3_result_text16() should be instructed to copy the string using SQLITE_TRANSIENT), otherwise the resulting memory goes out of scope. [see 4th parameter]
  • Expecting UTF-16 encoded strings is correct on Windows, but not on other operating systems like macos/Linux:
    • On Windows, everything's working fine: sizeof(wchar_t) == 2 (16-bit), and encoding is UTF-16.
    • On macos/linux: sizeof(wchar_t) == 4 (32-bit):
      • Using sqlite3_*_16() functions is outrightly wrong.
      • Using codecvt_utf8_utf16<> is bad:
        • While it's not prohibited to use wchar_t for UTF-16, it easily leads to subtly unexpected behaviour: Because wchar_t is 32-bit, it usually carries UTF-32, not UTF-16.
          • Passed UTF-32 strings are suddenly treated as UTF-16 by sqlite_orm/sqlite.
          • Returned UTF-16 strings are suddenly treated as UTF-32 by the program.
          • In any case, [codecvt_utf8_utf16<>](https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16) expects UTF-16, no matter the sizeof wchar_t: "If Elem is a 32-bit type, one UTF-16 code unit will be stored in each 32-bit character of the output sequence.". I emphasize again that this isn't the regular expectation on macos/Linux.
    • While we are at it, I'd like to see a separation of wide-string support from SQLITE_ORM_OMITS_CODECVT, if possible: One might want to be able to pass or return wide-strings from Windows API functions, even if not being able to serialize the statement.

One way of fixing the UTF-16 issue on macos/Linux quickly is by disabling UTF-16 unicode when not on Windows altogether. This might not even have any impact, given that UTF-8 is prevalent on those systems.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions