chore(ci): Ensure integration workflow passes #643

paleolimbot · 2024-10-02T20:50:15Z

Closes #641.

Unfortunately we just have to skip checking Rust compatibility due to apache/arrow-rs#5052 (e.g., apache/arrow-rs#6449 ).

This PR also ensures compatibility with big endian Arrow files and Arrow files from before the continuation token. Support for those had already been added in the decoder but hadn't made it to the stream reader yet.

Local check:

# Assumes arrow-testing, arrow-nanoarrow, and arrow are all checked out in the same dir
export gold_dir=../arrow-testing/data/arrow-ipc-stream/integration 
export ARROW_NANOARROW_PATH=$(pwd)/build 
pip install -e "../arrow/dev/archery/[all]"
archery integration --with-nanoarrow=true --run-ipc \
    --gold-dirs=$gold_dir/0.14.1 \
    --gold-dirs=$gold_dir/0.17.1 \
    --gold-dirs=$gold_dir/1.0.0-bigendian \
    --gold-dirs=$gold_dir/1.0.0-littleendian \
    --gold-dirs=$gold_dir/2.0.0-compression \
    --gold-dirs=$gold_dir/4.0.0-shareddict

paleolimbot · 2024-10-02T20:59:15Z

We still have a note about flatbuffer alignment for integration/0.14.1, which I believe are the versions of the format that didn't have the continuation token. We probably have to align the message like C++ does (since a message could also arrive in other ways that would result in it being unaligned), and perhaps there is also a buffer size calculation error.

# archery integration --with-nanoarrow=true --run-ipc --gold-dirs=$gold_dir/0.14.1 --debug
======================================================================
Command STREAM_TO_FILE failed (22=Invalid argument): Message flatbuffer verification failed (12) table field not aligned
Command STREAM_TO_FILE failed (22=Invalid argument): Expected at least 8 bytes in remainder of stream
======================================================================
Testing file /var/folders/gt/l87wjg8s7312zs9s7c1fgs900000gn/T/tmpmhx_m5qe/0.14.1_datetime.gold.json
-- Validating file
/Users/deweydunnington/Desktop/rscratch/arrow-nanoarrow/build/nanoarrow_ipc_integration {'ARROW_PATH': '../arrow-testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.arrow_file', 'JSON_PATH': '/var/folders/gt/l87wjg8s7312zs9s7c1fgs900000gn/T/tmpmhx_m5qe/0.14.1_datetime.gold.json', 'COMMAND': 'VALIDATE', 'QUIRK_no_date64_validate': '1', 'QUIRK_no_decimal_validate': '1', 'QUIRK_no_times_validate': '1'}
-- Validating stream
/Users/deweydunnington/Desktop/rscratch/arrow-nanoarrow/build/nanoarrow_ipc_integration < ../arrow-testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream
Traceback (most recent call last):
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/runner.py", line 307, in _run_ipc_test_case
    run_binaries(producer, consumer, test_case)
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/runner.py", line 129, in run_gold
    return self._run_gold(gold_dir, consumer, test_case)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/runner.py", line 385, in _run_gold
    consumer.stream_to_file(consumer_stream_path, consumer_file_path)
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/tester_nanoarrow.py", line 76, in stream_to_file
    self.run_shell_command([_INTEGRATION_EXE, '<', stream_path], env={
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/tester.py", line 233, in run_shell_command
    subprocess.check_call(cmd, **kwargs)
  File "/opt/homebrew/Cellar/[email protected]/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/Users/deweydunnington/Desktop/rscratch/arrow-nanoarrow/build/nanoarrow_ipc_integration < ../arrow-testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream' returned non-zero exit status 22.

We also have big-endian failures related to endian swapping.

Testing file /var/folders/gt/l87wjg8s7312zs9s7c1fgs900000gn/T/tmpq4m_94qy/1.0.0-bigendian_decimal256.gold.json
-- Validating file
/Users/deweydunnington/Desktop/rscratch/arrow-nanoarrow/build/nanoarrow_ipc_integration {'ARROW_PATH': '../arrow-testing/data/arrow-ipc-stream/integration/1.0.0-bigendian/generated_decimal256.arrow_file', 'JSON_PATH': '/var/folders/gt/l87wjg8s7312zs9s7c1fgs900000gn/T/tmpq4m_94qy/1.0.0-bigendian_decimal256.gold.json', 'COMMAND': 'VALIDATE', 'QUIRK_no_times_validate': '1', 'QUIRK_no_decimal_validate': '1', 'QUIRK_no_date64_validate': '1'}
Traceback (most recent call last):
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/util.py", line 136, in run_cmd
    output = subprocess.check_output(cmd, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/Users/deweydunnington/Desktop/rscratch/arrow-nanoarrow/build/nanoarrow_ipc_integration']' returned non-zero exit status 22.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/runner.py", line 307, in _run_ipc_test_case
    run_binaries(producer, consumer, test_case)
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/runner.py", line 129, in run_gold
    return self._run_gold(gold_dir, consumer, test_case)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/runner.py", line 373, in _run_gold
    consumer.validate(json_path, producer_file_path,
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/tester_nanoarrow.py", line 70, in validate
    return self._run(arrow_path, json_path, 'VALIDATE', quirks)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/tester_nanoarrow.py", line 67, in _run
    run_cmd([_INTEGRATION_EXE], env=env)
  File "/Users/deweydunnington/Desktop/rscratch/arrow/dev/archery/archery/integration/util.py", line 145, in run_cmd
    raise RuntimeError(sio.getvalue())
RuntimeError: Command failed: /Users/deweydunnington/Desktop/rscratch/arrow-nanoarrow/build/nanoarrow_ipc_integration
With output:
--------------
Validating that ../arrow-testing/data/arrow-ipc-stream/integration/1.0.0-bigendian/generated_decimal256.arrow_file reads identical to /var/folders/gt/l87wjg8s7312zs9s7c1fgs900000gn/T/tmpq4m_94qy/1.0.0-bigendian_decimal256.gold.json
Command VALIDATE failed (22=Invalid argument): Found 1089 differences between batches:
Path: Batch 0.children[0]
- {"name": "f0", "count": 7, "VALIDITY": [0, 1, 1, 1, 1, 0, 1], "DATA": ["0", "-37078337769718894962251108817035780212840189605462990638900398314494535663617", "44012908020322448267609753248962852605928833126326122667726357114593162035199", "-54009362295389357650022329187996731063181254861555557649287163238231163535361", "-12982109917889636080410433322026198813372849684931462221804907743398852558849", "0", "37865473646572296991777977983078515742704918403723945086956043389824762118143"]}
+ {"name": "f0", "count": 7, "VALIDITY": [0, 1, 1, 1, 1, 0, 1], "DATA": ["0", "-41942369422925886428931794240078936402", "-114321348277747672141566107601624871327", "-7473422529517336432944396399317313656", "-31083044846781926940152882137004880669", "0", "-140830396400176073233117491079825213613"]}

Path: Batch 0.children[1]
- {"name": "f1", "count": 7, "VALIDITY": [0, 0, 0, 0, 1, 0, 0], "DATA": ["0", "0", "0", "0", "-1620272279296278412426461366559779130214349

--------------

…compressed/dictionary-encoded files (#44298) ### Rationale for this change There are a few remaining failures when testing nanoarrow against itself: apache/arrow-nanoarrow#643 . Our IPC reader doesn't support dictionaries or compression, so we can't run those tests. ### What changes are included in this PR? Skips were added to the archery code that runs the tests. ### Are these changes tested? Yes (integration tests run on every commit) ### Are there any user-facing changes? No! * GitHub Issue: #44297 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

paleolimbot · 2024-10-04T14:12:28Z

@bkietz Is there any chance you have the bandwidth to make sure I didn't miss anything here?

bkietz · 2024-10-04T15:19:22Z

src/nanoarrow/ipc/reader.c

@@ -230,6 +230,20 @@ static int ArrowIpcArrayStreamReaderNextHeader(
    // propagated higher (e.g., if the stream is empty and there's no schema message)
    ArrowErrorSet(&private_data->error, "No data available on stream");
    return ENODATA;
+  } else if (bytes_read == 4) {


I think this special case should specifically assert the metadata version we're currently reading. It's never valid for a V5 message to omit the continuation and a V4 message is invalid if it includes that.

And we can't write:

Suggested change

} else if (bytes_read == 4) {

// DON'T commit; broken!

} else if (private_data->decoder.metadata_version == NANOARROW_IPC_METADATA_VERSION_V4 && bytes_read == 4) {

because these peek etc functions are used before we've fully unpacked the schema to know what the metadata version is.

So I think what needs to happen is: the first time a decoder peeks a header and observes the lack of a continuation, we set ArrowIpcDecoderPrivate::prefix_length = 4. Later when we decode the metadata version, we can raise an error if the prefix length is not as expected. Also we should raise if the prefix length is inconsistent in any future message peeked by the decoder.

As a pleasant side effect, there will be one fewer argument for you to pass around :D

In a fun turn of events, not all V4 metadata messages have a 4 byte prefix size (i.e., not in the 0.17.1 golden files). I think I included all the other checks you suggested! (And also opened an issue with instructions for how to skip those cases in the future should we decide they don't need to be supported): #648

paleolimbot · 2024-10-07T16:01:54Z

I'm going to merge this to get the release candidate testing process started...feel free to leave comments and I can update the approach!

skip rust

1380e16

paleolimbot and others added 3 commits October 2, 2024 16:28

whoops

c98bf70

ergggh

5325089

make sure decoding the footer sets proper fields based on the schema

934dee7

paleolimbot mentioned this pull request Oct 3, 2024

GH-44297: [Integration][CI] Skip nanoarrow IPC integration tests for compressed/dictionary-encoded files apache/arrow#44298

Merged

revert metadata version change

b7df511

paleolimbot marked this pull request as ready for review October 4, 2024 14:11

bkietz requested changes Oct 4, 2024

View reviewed changes

paleolimbot and others added 4 commits October 4, 2024 12:30

maybe better errors for legacy streams

c195982

maybe fix

5c5d9a6

fix typo

6d9407a

actually fix

468cf73

paleolimbot mentioned this pull request Oct 4, 2024

Skip pre-1.0.0 IPC integration tests #648

Open

paleolimbot added 2 commits October 4, 2024 15:14

also check metadata version

caad98e

link issue in one more place

ec7a2a1

paleolimbot merged commit 19d35d7 into apache:main Oct 7, 2024
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(ci): Ensure integration workflow passes #643

chore(ci): Ensure integration workflow passes #643

paleolimbot commented Oct 2, 2024 •

edited

Loading

paleolimbot commented Oct 2, 2024

paleolimbot commented Oct 4, 2024

bkietz Oct 4, 2024

paleolimbot Oct 4, 2024

paleolimbot commented Oct 7, 2024

	} else if (bytes_read == 4) {
	// DON'T commit; broken!
	} else if (private_data->decoder.metadata_version == NANOARROW_IPC_METADATA_VERSION_V4 && bytes_read == 4) {

chore(ci): Ensure integration workflow passes #643

chore(ci): Ensure integration workflow passes #643

Conversation

paleolimbot commented Oct 2, 2024 • edited Loading

paleolimbot commented Oct 2, 2024

paleolimbot commented Oct 4, 2024

bkietz Oct 4, 2024

Choose a reason for hiding this comment

paleolimbot Oct 4, 2024

Choose a reason for hiding this comment

paleolimbot commented Oct 7, 2024

paleolimbot commented Oct 2, 2024 •

edited

Loading