When ingesting VECTOR<FLOAT,n> data from a JSON, dsbulk (v 1.11) fails for "floats" which are represented with too many digits. They end up being double, which seems to cause unrecoverable problems then.
Notes:
- JSON produced by dsbulk itself are OK, i.e. their floats are floats proper (low number of digits).
- But, with folks coming to load datasets generated elsewhere (viz Python, which lacks a clear float/double distinction) this limitation might get in the way.
Minimal reproducible case
create table mini_table (id text primary key, embedding vector<float, 2>);
java -jar dsbulk-1.11.0.jar load -k $KEYSPACE -t mini_table -u "token" -p $TOKEN -b $BUNDLEZIP --dsbulk.connector.json.mode SINGLE_DOCUMENT --connector.json.url GOOD_OR_BAD.json -c json
$> cat good.json
[
{
"id": "my_row",
"embedding": [
6.64632,
4.49715
]
}
]
$> cat bad.json
[
{
"id": "my_row",
"embedding": [
6.646329843,
4.4971533213
]
}
]