Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read bytes from String or FixedString(N) #282

Open
c0mm4nd opened this issue Jun 10, 2024 · 4 comments
Open

Cannot read bytes from String or FixedString(N) #282

c0mm4nd opened this issue Jun 10, 2024 · 4 comments
Labels
question Further information is requested

Comments

@c0mm4nd
Copy link

c0mm4nd commented Jun 10, 2024

Describe the bug

Directly select a String or FixedString(N) column which stores raw bytes rather than UTF-8 string, the result will contain

e.g.
image

the correct value of this should be the hash unhex('000000708550f340a1297eefe721a3b0631d8dc4cc5a3462abaeef1a79726f6b')

Steps to reproduce

  1. Store raw hash or any other raw bytes in Clickhouse DB
  2. Directly select it through clickhouse-js, formatting as any CSV or JSON
  3. convert the string to bytes
                                    function stringToBytes(str) {
                                      const bytes = [];
                                      for (let i = 0; i < str.length; i++) {
                                        const codeUnit = str.charCodeAt(i);
                                        bytes.push(codeUnit);
                                      }
                                      return bytes;
                                    }
  1. the value is incorrect and ruined by 65533

Expected behaviour

return a correct string or bytes

Code example

as shown in Steps to reproduce

Error log

Configuration

Environment

  • Client version: latest, 1.1.0
  • Language version:
  • OS: Chrome

ClickHouse server

  • ClickHouse Server version: clickhouse/clickhouse-server:23.12.4-alpine
  • ClickHouse Server non-default settings, if any:
  • CREATE TABLE statements for tables involved:
  • Sample data for all these tables, use clickhouse-obfuscator if necessary
@c0mm4nd c0mm4nd added the bug Something isn't working label Jun 10, 2024
@slvrtrn
Copy link
Contributor

slvrtrn commented Jun 10, 2024

This is because the client's ResultSet text and JSON decoders assume a UTF-8 string inside that field (but it is not a UTF-8 string in this specific case).

For now, a passable workaround could be to just convert it back to bytes:

import { createClient } from '@clickhouse/client' // or '@clickhouse/client-web'

void (async () => {
  const client = createClient()
  const resultSet = await client.query({
    query: `SELECT unhex('000000708550f340a1297eefe721a3b0631d8dc4cc5a3462abaeef1a79726f6b') AS x`,
    format: 'JSONEachRow',
  })
  const result = await resultSet.json<{ x: string }>()
  console.info('Result:', new TextEncoder().encode(result[0].x))
  await client.close()
})()
Result: Uint8Array(58) [
    0,   0,   0, 112, 239, 191, 189,  80, 239, 191,
  189,  64, 239, 191, 189,  41, 126, 239, 191, 189,
  239, 191, 189,  33, 239, 191, 189, 239, 191, 189,
   99,  29, 239, 191, 189, 239, 191, 189, 239, 191,
  189,  90,  52,  98, 239, 191, 189, 239, 191, 189,
  239, 191, 189,  26, 121, 114, 111, 107
]

@slvrtrn slvrtrn added question Further information is requested and removed bug Something isn't working labels Jun 10, 2024
@slvrtrn
Copy link
Contributor

slvrtrn commented Jun 11, 2024

Another solution (which looks like a more correct one in this case, cause the previous result set looks a bit off) is just to ask ClickHouse to return a byte array instead so that conversion happens on the server side:

WITH '000000708550f340a1297eefe721a3b0631d8dc4cc5a3462abaeef1a79726f6b' AS str
SELECT arrayMap(x -> reinterpretAsUInt8(x), splitByString('', unhex(str))) AS bytes

Result:

   ┌─bytes─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │ [0,0,0,112,133,80,243,64,161,41,126,239,231,33,163,176,99,29,141,196,204,90,52,98,171,174,239,26,121,114,111,107] │
   └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

And that will be decoded with JSON formats as just an array of numbers, which can be then used to create a UInt8Array in Node.

@c0mm4nd, can you please check if it works in your scenario? Then I can add an example and close this.

@c0mm4nd
Copy link
Author

c0mm4nd commented Jun 13, 2024

In the first passable solution, the hash value length is 58 rather than 32, due to the invalid unicode 65533 has already ruined the result. I cannot convert the 65533 back to the correct byte.
The second works, but it requires rewriting the SQL, and the bytes-parse work is not efficient when handling tons of hash results.
So these two are not what I want.

Python client clickhouse-connect also works on HTTP port, but it doesn't have such a problem.

@slvrtrn
Copy link
Contributor

slvrtrn commented Jun 13, 2024

Another option is to return the fixed string encoded as base64, which will require rewriting the queries a bit.

For example:

import { createClient } from '@clickhouse/client'

void (async () => {
  const client = createClient()
  const resultSet = await client.query({
    query: `SELECT base64Encode(unhex('000000708550f340a1297eefe721a3b0631d8dc4cc5a3462abaeef1a79726f6b')) AS x`,
    format: 'JSONEachRow',
  })
  const result = await resultSet.json<{ x: string }>()
  const buffer = Buffer.from(result[0].x, 'base64')
  console.info('Base64:', result[0].x)
  console.info('Buffer:', buffer, 'with length:', buffer.length)
  await client.close()
})()
Base64: AAAAcIVQ80ChKX7v5yGjsGMdjcTMWjRiq67vGnlyb2s=
Buffer: <Buffer 00 00 00 70 85 50 f3 40 a1 29 7e ef e7 21 a3 b0 63 1d 8d c4 cc 5a 34 62 ab ae ef 1a 79 72 6f 6b> with length: 32

Python client clickhouse-connect also works on HTTP port, but it doesn't have such a problem.

I'd assume it works there because it uses RowBinary (so that custom type and column mappers can be provided), which has not yet been implemented in this client. See this issue for tracking; I had to put it on hold due to other duties, as this feature is massive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants