Skip to content

JSON connector doesn't retain unicode values #508

@msmygit

Description

@msmygit

Version used: 1.11.0

Table Schema (in DataStax Astra DB Serverless):

CREATE TABLE db3.json_load_codec (
    i text PRIMARY KEY,
    j text
);

Input Files:

JSON

[{"i":"json","j":"NO\u001a\u001aL\\"}]

CSV

i,j
csv,"NO\u001a\u001aL\\"

Scenario 1 - Fails when using JSON connector

./dsbulk load  -k db3 -t json_load_codec -b /Users/madhavan.sridharan/Documents/Data/07_downloads/secure-connect-db3.zip -c json -u token -p 'AstraCS:REDACTED' -url /Users/madhavan.sridharan/Downloads/codec_load.json --dsbulk.connector.json.mode SINGLE_DOCUMENT -verbosity normal --dsbulk.codec.binary HEX
Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
A cloud secure connect bundle was provided: ignoring all explicit contact points.
A cloud secure connect bundle was provided and selected operation performs writes: changing default consistency level to LOCAL_QUORUM.
Operation directory: /Users/madhavan.sridharan/Documents/Data/05_tools/dsbulk/dsbulk-1.11.0/bin/logs/LOAD_20251021-003021-074839
Setting executor.maxPerSecond not set when connecting to DataStax Astra: applying a limit of 27,000 ops/second based on the number of coordinators (9).
If your Astra database has higher limits, please define executor.maxPerSecond explicitly.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    1 |      0 |      9 | 46.53 | 46.66 |  46.66 |    1.00
Operation LOAD_20251021-003021-074839 completed successfully in less than one second.
Checkpoints for the current operation were written to checkpoint.csv.
To resume the current operation, re-run it with the same settings, and add the following command line flag:
--dsbulk.log.checkpoint.file=/Users/madhavan.sridharan/Documents/Data/05_tools/dsbulk/dsbulk-1.11.0/bin/logs/LOAD_20251021-003021-074839/checkpoint.csv

Scenario 2 - Success when using CSV connector

% ./dsbulk load  -k db3 -t json_load_codec -b /Users/madhavan.sridharan/Documents/Data/07_downloads/secure-connect-db3.zip -c csv -u token -p 'AstraCS:REDACTED' -url /Users/madhavan.sridharan/Downloads/codec_load.csv -verbosity normal
Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
A cloud secure connect bundle was provided: ignoring all explicit contact points.
A cloud secure connect bundle was provided and selected operation performs writes: changing default consistency level to LOCAL_QUORUM.
Operation directory: /Users/madhavan.sridharan/Documents/Data/05_tools/dsbulk/dsbulk-1.11.0/bin/logs/LOAD_20251021-003008-168371
Setting executor.maxPerSecond not set when connecting to DataStax Astra: applying a limit of 27,000 ops/second based on the number of coordinators (9).
If your Astra database has higher limits, please define executor.maxPerSecond explicitly.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    1 |      0 |     10 | 26.15 | 26.21 |  26.21 |    1.00
Operation LOAD_20251021-003008-168371 completed successfully in less than one second.
Checkpoints for the current operation were written to checkpoint.csv.
To resume the current operation, re-run it with the same settings, and add the following command line flag:
--dsbulk.log.checkpoint.file=/Users/madhavan.sridharan/Documents/Data/05_tools/dsbulk/dsbulk-1.11.0/bin/logs/LOAD_20251021-003008-168371/checkpoint.csv

OUTPUT showing the above 2 records

token@cqlsh:db3> select * from json_load_codec ;

 i    | j
------+------------------
  csv | NO\u001a\u001aL\
 json |     NO\x1a\x1aL\

(2 rows)

Caution

Look at the JSON connector inserted output, which is incorrect

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions