Skip to content

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

@hemidactylus

Description

@hemidactylus

When ingesting VECTOR<FLOAT,n> data from a JSON, dsbulk (v 1.11) fails for "floats" which are represented with too many digits. They end up being double, which seems to cause unrecoverable problems then.

Notes:

  1. JSON produced by dsbulk itself are OK, i.e. their floats are floats proper (low number of digits).
  2. But, with folks coming to load datasets generated elsewhere (viz Python, which lacks a clear float/double distinction) this limitation might get in the way.

Minimal reproducible case

create table mini_table (id text primary key, embedding vector<float, 2>);
java -jar dsbulk-1.11.0.jar load -k $KEYSPACE -t mini_table -u "token" -p $TOKEN -b $BUNDLEZIP --dsbulk.connector.json.mode SINGLE_DOCUMENT --connector.json.url GOOD_OR_BAD.json -c json
$> cat good.json 
[
 {
  "id": "my_row",
  "embedding": [
   6.64632,
   4.49715
  ]
 }
]

$> cat bad.json 
[
 {
  "id": "my_row",
  "embedding": [
   6.646329843,
   4.4971533213
  ]
 }
]

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions